To start working on the homework we have download the stocks data. For this we have first created the CSV file named as stocks.csv which simply contains the names of the comapnies for those we have to collect the data. To do this we have also focused on collecting the 10 companies data in each category discussed in homework description. We have also noticed that instead of 10 categories there are 11 categories. Also for this part we have focused on collecting the data from January 1, 2003 through January 1, 2008.
This makes us total of 110 companies categorized into 11 Global Industry Classification Standard (GICS).
Lets assume we have already all the packages. I will include them here.
require(tseries, warn.conflicts=F, quietly = TRUE)
require(igraph, warn.conflicts=F, quietly = TRUE)
require(lemon,warn.conflicts=F, quietly = TRUE)
require(GGally,warn.conflicts=F, quietly = TRUE)
After that I will just read the csv file which will have rows. I will show rows that it contains. I also have to use the lemon library to pretty print the dataframe values.
stock <- read.csv("stocks.csv")
knitr::kable(head(stock), format="markdown")
| Symbol | Security | GICSSector | GICS.Sub.Industry | Location | Date.first.added | CIK | Founded |
|---|---|---|---|---|---|---|---|
| A | Agilent Technologies Inc | Health Care | Health Care Equipment | Santa Clara, California | 6/5/2000 | 1090872 | 1999 |
| UNP | Union Pacific | Industrials | Railroads | Omaha, Nebraska | 100885 | ||
| AAP | Advance Auto Parts | Consumer Discretionary | Automotive Retail | Roanoke, Virginia | 7/9/2015 | 1158449 | 1932 |
| AAPL | Apple Inc. | Information Technology | Technology Hardware, Storage & Peripherals | Cupertino, California | 11/30/1982 | 320193 | 1977 |
| ABC | AmerisourceBergen Corp | Health Care | Health Care Distributors | Chesterbrook, Pennsylvania | 8/30/2001 | 1140859 | 1985 |
| ABMD | ABIOMED Inc | Health Care | Health Care Equipment | Danvers, Massachusetts | 5/31/2018 | 815094 | 1981 |
From above we can see that Symbol column contains the companies name and GICSSector contains the category. Now I will download the data from January 1, 2003 through January 1, 2008.
symbols <- stock[["Symbol"]]
data <- NULL
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
for (row in symbols) {
if (is.null(data)) {
data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
}
else {
temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
data <- cbind(data, temp)
}
}
cMatrix <- as.matrix(data)
After downloading the data in cMatrix we have to assign the names to columns and we have to remove the na values and infinite values. We are replacing them with 0 instead of removing the whole row. For getting required X matrix we also have to use diff and log formula on our matrix. I will be outputing its first few rows and columns using kable function.
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
logMatrix[is.infinite(logMatrix)] <- 0
knitr::kable(head(logMatrix[,1:20],6), format="markdown")
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | ADM | ADP | ADS | ADSK | AEE | AEP | AES | AFL | AGN | AIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-01-03 | -0.0047133 | -0.0065585 | -0.0155700 | 0.0067342 | 0.0149835 | 0.0231076 | 0.0096908 | 0.0074946 | 0.0269766 | 0.0042381 | 0.0063694 | -0.0056783 | -0.0027816 | -0.0284840 | 0.0136249 | 0.0200253 | 0.0628009 | 0.0003165 | 0.0193509 | -0.0029895 |
| 2003-01-06 | 0.0466631 | 0.0179333 | -0.0659116 | 0.0000000 | 0.0118948 | 0.0275362 | 0.0064087 | 0.0184946 | 0.0449806 | 0.0596970 | 0.0015860 | 0.0027197 | -0.0168544 | 0.0284840 | 0.0418107 | 0.0590889 | 0.0000000 | 0.0125788 | 0.0129210 | 0.0330474 |
| 2003-01-07 | -0.0090589 | 0.0001616 | 0.0056436 | -0.0033619 | -0.0260110 | -0.0849932 | -0.0460053 | -0.0195618 | 0.0357053 | 0.0046974 | -0.0143658 | -0.0094270 | 0.0056497 | 0.0422456 | -0.0350673 | -0.0175298 | -0.0234615 | -0.0084733 | -0.0033841 | -0.0196643 |
| 2003-01-08 | -0.0497512 | -0.0130083 | -0.0062969 | -0.0204083 | -0.0372670 | -0.0300159 | 0.0248996 | -0.0069649 | -0.0496148 | -0.0266682 | -0.0072610 | -0.0085107 | -0.0028208 | -0.0105612 | 0.0018522 | -0.0060240 | -0.0059524 | 0.0022037 | -0.0164050 | -0.0125517 |
| 2003-01-09 | 0.0390776 | 0.0105821 | 0.0140589 | 0.0088943 | -0.0128590 | 0.2314195 | 0.0032570 | -0.0124426 | 0.0400993 | 0.0484199 | 0.0128723 | 0.0159604 | 0.0168071 | 0.0138388 | -0.0224571 | 0.0070246 | 0.0409414 | 0.0125002 | 0.0027529 | 0.0363914 |
| 2003-01-10 | 0.0167220 | -0.0035691 | -0.0032269 | 0.0027219 | -0.0113658 | -0.0968498 | -0.0050151 | -0.0822226 | 0.0144140 | 0.0205977 | -0.0088319 | -0.0106952 | 0.0456105 | -0.0245127 | -0.0066461 | -0.0537525 | -0.0379608 | -0.0125002 | -0.0093217 | -0.0095002 |
After that I will plot logMatrix which will be required X matrix.
plot(logMatrix)
Above completes point 2 part of the homework in Your Job section.
Lets move to point 3.
For this we have to calculate Pearson correlation coefficient between stocks. For this we have to simply use cor function.
pearson_correlation_matrix <- cor(logMatrix, method = "pearson", use = "everything")
R <- pearson_correlation_matrix
R[is.na(R)] <- 0
knitr::kable(head(R[,1:20],6), format="markdown") # outputing the rows
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | ADM | ADP | ADS | ADSK | AEE | AEP | AES | AFL | AGN | AIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 1.0000000 | 0.3132293 | 0.2453006 | 0.2894047 | 0.2060193 | 0.1929294 | 0.2173697 | 0.2595466 | 0.3440663 | 0.4530926 | 0.1865611 | 0.3108620 | 0.2168602 | 0.2993274 | 0.2265353 | 0.2792155 | 0.2083448 | 0.2708431 | 0.2246741 | 0.3329415 |
| UNP | 0.3132293 | 1.0000000 | 0.2409231 | 0.2394450 | 0.2275416 | 0.1805622 | 0.2706303 | 0.2816294 | 0.2722011 | 0.2881027 | 0.2555853 | 0.3005328 | 0.2097893 | 0.2803825 | 0.2796656 | 0.2944920 | 0.2480392 | 0.2920236 | 0.1999380 | 0.3494798 |
| AAP | 0.2453006 | 0.2409231 | 1.0000000 | 0.1999078 | 0.1331749 | 0.1000599 | 0.1607918 | 0.1810424 | 0.1746663 | 0.2027267 | 0.1106188 | 0.1685443 | 0.1931692 | 0.1736631 | 0.1184338 | 0.1588764 | 0.1793692 | 0.2229857 | 0.1487453 | 0.2467757 |
| AAPL | 0.2894047 | 0.2394450 | 0.1999078 | 1.0000000 | 0.1270099 | 0.0927325 | 0.1895600 | 0.1360781 | 0.3016231 | 0.2960004 | 0.1349346 | 0.2564773 | 0.2002367 | 0.2784136 | 0.1937273 | 0.2194870 | 0.1332074 | 0.2081890 | 0.2015270 | 0.2546643 |
| ABC | 0.2060193 | 0.2275416 | 0.1331749 | 0.1270099 | 1.0000000 | 0.1144457 | 0.2634622 | 0.1995059 | 0.1942354 | 0.1777007 | 0.1762504 | 0.2228011 | 0.1627588 | 0.1479000 | 0.2147295 | 0.2136665 | 0.1685642 | 0.1420544 | 0.2784182 | 0.2002196 |
| ABMD | 0.1929294 | 0.1805622 | 0.1000599 | 0.0927325 | 0.1144457 | 1.0000000 | 0.1759887 | 0.1675586 | 0.1482111 | 0.1929230 | 0.0936166 | 0.1393736 | 0.1013807 | 0.1779248 | 0.1643561 | 0.2037472 | 0.1463400 | 0.1504244 | 0.1378275 | 0.1819907 |
From above we can see that columns and rows that have same value have higher dependency on each other usaully 1 as you can see the first row and first column value of “A” company.
Now lets move to calculating Marginal Correlation Graphs.. But first we have to calculating the bootstrap confidence interval for this we have used our own algorithm instead of using Boot function.
#Calculating bootstarp confidence interval
B = 1000 # considering only 1000 iteration for sampling purposes
brep = rep(NA, B)
for (b in 1:B){
row_idx <- sample(1:nrow(logMatrix), replace = T)
bSample <- logMatrix[row_idx,] # bootstrap sample
rownames(bSample) <- rownames(logMatrix)
sample_Cor <- cor(bSample, method = "pearson", use = "everything") # getting correlation from original matrix for sample
sample_Cor[is.na(sample_Cor)] <- 0 # replacing values
btheta <- sqrt(nrow(logMatrix))*max(sample_Cor-R) # bootstrap replicate
brep[b] <- btheta # save
}
After above we have our bootstrapped replicated values. Till now we have ∆b After we will calculate the our Emperical CDF.
Gstar<- ecdf(brep)
plot(Gstar)
From above you can see that values graph changes between 0 to 1. Now I have Fn(t) hat. I am going to move t alpha based on 0.95 confidence.
confidence <- as.numeric(quantile(brep,0.95)/sqrt(nrow(logMatrix)))
plot(confidence)
From above you can notice that its value is 0.18. Lets move to building PR[j,k]∈Cj,k(α) for all (j,k) →n 1−α for this I need build new metrix based on this condition “If we have a confidence interval Cn,α then we can put an edge whenever [−ε, +ε] ∩ Cn,α = ∅.”
I am going to a function which will create matrix for edges if based on our confidence interval and error as ee = 0.1
matrix_apply <- function(m,ee,con) {
m2 <- m
for (r in seq(nrow(m2))){
for (c in seq(ncol(m2))){
l <- m[[r,c]]-con
h <- m[[r,c]]+con
eel <- ee*-1
eeh <- ee
if(h <= eel || l >= eeh){
m2[[r,c]]<-1
}
else{
m2[[r,c]]<-0
}
}
}
return(m2)
}
ee<-0.13
m<-matrix_apply(R,ee,confidence)
knitr::kable(head(m[,1:10],6), format="markdown")
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| UNP | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAP | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAPL | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| ABC | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| ABMD | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
From above we have 1 in the matrix when edge is possible and 0 otherwise. Moving to creating a graph from adjacency matrix as m and simplify it if there are some stocks which points to itself. But before doing I need a function which will color the stocks which has same GICS .
color_graph <- function(graph,stocks){
sectors <- unique(stocks$GICSSector) # getting GICS
colors <- rainbow(length(sectors)) # getting colors based on GICS
for(i in 1:nrow(stocks)) {
row <- stocks[i,]
#assigning the colors
for(j in 1:length(sectors)) {
if(sectors[j] == row$GICSSector){
V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
break
}
}
}
return (graph)
}
Lets plot the graph after using above function for coloring.
ee<- 0.13
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock))
From above you can see that with error 0.13 we get the perfect result of what we have wanted to support our claim that “stocks from the same sector cluster together”. To support this claim I have drawn the graph containing unique color for stocks that are in the same GICS
We will going to run this with different error values to see what will happen to our stocks when we increase error value ee
ee<- 0.02
for( i in 1:20){
ee <-ee + 0.02
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock), main=c("Error value as ",ee))
}
From above you can see that our confidence value wa around 0.18 initailly and as we increase our error value from 0.02 to 0.42 stocks from same GICS started spreading instead of merging on same clustor. But still you can see that stocks from same GICS always clustered togethor.